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Unsupervised  Classification  System  for  Hvoerspectral  Data  Analysis 

Luis  O.  Jimenez,  Miguel  Velez,  and  Shawn  Hunt 
Laboratory  of  Applied  Remote  Sensing  and  Image  Processing 
University  of  Puerto  Rico  at  Mayaguez 

1.  Foreword 

Hyperspectral  remote  sensing  imaging  involves  collecting  hundreds  of  multiple,  contiguous,  narrowband,  spectra.  This 
represents  a  collection  of  data  nearly  two  orders  of  magnitude  greater  than  current  monochromatic  and  multispectral  imaging. 
Specific  applications  of  interest  to  the  DoD  include;  environment  -  pollution  detection;  geology  -  mineral  detection,  surface 
materials,  major  rock  types,  altered  rocks;  hydrology  -  water  quality,  point  and  no  point  pollution;  archaeology  -  further 
characterization  of  known  area,  localize  dig;  agriculture  -  type,  structure,  texture  and  moisture  of  soils; /ores/fy-  vegetation 
type  mapping,  quantification  of  biomass,  stress  detection;  oceanography  -  mapping  littoral  areas,  water  characteristics; 
marine  biology  -  characterizing  surface  environment;  endangered  species  -  characterizing  known  environment;  surveillance  - 
object  recognition,  target  identification. 

There  is  an  increasing  interest  in  remote  sensed  vision  systems  for  surveillance,  object  recognition,  target  identification,  and 
cartography.  Other  fields  that  could  benefit  from  such  systems  are  land  use  and  land  cover  administration,  estimation  of  water 
sedimentation,  and  the  creation  of  maps.  Of  particular  interest  are  automatic  systems  that  are  robust  with  respect  to  analyst’s 
knowledge  in  areas  such  as  image  analysis  and  computer  vision.  Remote  Sensed  image  analysis  focuses  on  obtaining 
information  from  radar,  multispectral  and  hyperspectral  images.  Hyperspectral  data  enables  the  analyst  to  detect  more 
materials,  objects,  and  regions  with  more  accuracy  than  previously  possible. 


As  the  number  of  bands  of  high  spectral  resolution  data  increases,  the  capability  to  detect  more  detailed  classes  should  also 
increase  and  the  classification  accuracy  should  increase  as  well.  The  curse  of  dimensionality  has  been  known  for  more  than 
three  decades.  There  is  a  need  for  the  development  of  algorithms  for  detection,  and  classification  that  utilize  the  amount  of 
information  and  separability  that  hyperdimensional  data  offers  while  simultaneously  avoiding  the  difficulties  inherent  in 
hyperdimensional  space. 

The  present  report  will  summarize  the  research  done  in  the  areas  of  clustering,  parameter  estimation  of  hyperspectral  data, 
band  subset  selection,  data  compression  and  unsupervised  decision  fusion  mechanism. 
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4.  Statement  of  Problem  Studied 

4.1  Unsupervised  Classification  System 

There  is  an  increasing  interest  in  remote  sensed  vision  systems  for  surveillance,  object  recognition,  target  identification,  and 
cartography.  Of  particular  interest  are  automatic  systems  that  are  robust  with  respect  to  analyst’s  knowledge  in  areas  such  as 
image  analysis  and  computer  vision.  Remotely  Sensed  image  analysis  focuses  on  obtaining  information  from  radar, 
multispectral  and  hyperspectral  images.  Hyperspectral  data  enables  the  analyst  to  detect  more  materials,  objects,  and  regions 
with  more  accuracy  than  previously  possible. 

As  the  number  of  bands  of  high  spectral  resolution  data  increases,  the  capability  to  detect  more  detailed  classes  should  also 
increase  and  the  classification  accuracy  should  increase  as  well.  Parallel  with  such  expectations  there  are  some  problems  that 
need  to  be  solved  before  being  able  to  retrieve  the  potentially  large  amount  of  information  from  hyperspectral  data.  The 
problems  that  need  to  be  solved  are  based  on  hyperdimensional  space  properties  and  their  implications  for  remote  sensing 
image  classification.  The  curse  of  dimensionality  has  been  known  for  more  than  three  decades.  In  combinatorial  optimization 
over  many  dimensions,  it  is  seen  as  an  exponential  growth  on  the  computational  effort  with  the  number  of  dimensions.  In 
statistics,  it  manifests  itself  as  a  problem  with  parameter  or  density  estimation  due  to  the  paucity  of  data.  Local 
neighborhoods  are  almost  surely  empty  of  dafta,  requiring  the  window  of  estimation  to  be  large  and  producing  the  effect  of 
losing  detailed  density  estimation.  This  implies  that  there  is  not  enough  data  to  produce  well  estimated  parameters  yielding 
rank-deficient  problems.  Most  of  present  detection  and  classification  algorithms  do  not  take  into  consideration  the  properties 
of  hyperdimensional  data.  What  is  called  for  is  the  development  of  algorithms  for  detection,  and  classification  that  utilize  the 
amount  of  information  and  separability  that  hyperdimensional  data  offers  while  simultaneously  avoiding  the  difficulties 
inherent  in  hyperdimensional  space. 

The  present  report  will  summarize  the  research  done  in  the  areas  of  clustering,  parameter  estimation  of  hyperspectral  data, 
band  subset  selection,  data  compression  and  unsupervised  decision  fusion  mechanism. 

4.2  Clustering  (Dr.  Luis  O.  Jimenez) 

Clustering  deals  with  the  task  of  splitting  a  set  of  data  points  into  a  number  of  more-or  less  homogeneous  classes  (clusters), 
with  respect  to  a  similarity  measure.  There  are  different  approaches  to  clustering  based  on  different  theoretical  systems. 
Among  the  different  cluster  approaches  are  statistical  approach,  fuzzy  clustering,  neural  network  clustering  and  fuzzy  logic- 
based  neural  network  clustering  to  mention  a  few  of  them.  In  this  research  project  we  explored  the  applications  of  Fuzzy  C- 
Means,  Neural  Network  clustering  and  Fuzzy  Neural  Network  to  hyperspectral  data  clustering  and  the  effect  of 
dimensionality  on  the  results. 
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4.2.1  Fuzzy  C-Means  Clustering 


The  Fuzzy  C-Means  clustering  algorithm  is  very  similar  to  the  Hard  C-Means  in  its  iterative  nature  and  in  the  computation  of 
the  cluster  centers.  On  the  other  hand  this  two  clustering  schemes  are  different  in  the  output  they  yield.  Hard  C-Means 
yields  well-defined  clusters  with  very  well  defined  characteristics,  and  belonging  to  a  cluster  implies  complete  exclusion  to 
the  others.  Fuzzy  C-Means  clustering  yields  a  partitioning  matrix: 


U  =  {i^}; 

(1) 

where  uu  is  the  grade  of  membership  of  the/*  data  point  to  the  i'*  cluster;  such  that , 

0<Ujj  <1 

(2) 

and. 

(3) 

where  C  is  the  total  number  of  fuzzy  clusters,  N  is  the  total  number  of  data  points  and 

2<C<N. 

(4) 

This  implies  that  each  data  point  belongs  to  every  cluster  but  with  a  degree  of  membership.  The  sum  of  the  membership 
grades  of  one  data  point  to  all  the  different  classes  is  normalized  to  be  equal  to  1 .  The  focus  of  the  algorithm  is  to  optimize  an 
objective  function,  which  acts  as  a  performance  index  of  the  clustering  algorithm.  The  form  for  the  objective  function  in  this 

algorithm  is: 

(5) 

To  perform  this  minimization  problem  we  first  differentiate  equation  (5)  with  respect  to  pj,  fixing 

Uij,  for  /  =  1,2,.,.,C  andy  = 

1,2,,,. ,N  and  then  we  differentiate  equation  (5)  with  respect  to  U|j  fixing  py,  for  /  =  1,2,...,C  and 

equation  (5),  thus  we  get: 

we  apply  the  conditions  in 

V-  1-'  r 

(6) 

and, 

'  ‘  ■  ■■^■--;V.  =  l....,C.j  =K..,N 

lim-hfr 

(7) 

k=l 


where  Xj  represents  the  data  point,  m  is  called  exponential  weight  and  it  influences  the  fuzziness  of  the  membership,  or 
partition  matrix.  Equation  (6)  and  equation  (7)  have  to  be  solved  using  an  iterative  method,  which  is  what  we  call  the  Fuzzy 
C-Means  (FCM)  Clustering  algorithm. 

The  steps  for  the  FCM  are: 

1.  Select  a  number  of  clusters  C  such  that  2  <  C  <  N  and  exponential  weight  m  such  that  1  <  iw  <  «>.  Choose  an  initial 
partition  matrix  most  of  the  time  since  we  have  no  a  priori  knowledge  of  the  data  characteristics  this  initial  partition 
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matrix  is  chosen  completely  arbitrary,  satisfying  equation  (3),  Also  choose  termination  criterion  e  and  set  iteration  index 

/-O. 

2.  Calculate  the  fiizzy  cluster  centers  using  and  equation  (6). 

3.  Calculate  the  new  partition  matrix  by  using  and  equation  (7). 

4.  Calculate  a  difference  relation  between  the  current  and  previous  iteration  objective 
function  values.  For  this  case  if: 


^-(J(M)-J(/))/J(/-l)>e.  (8) 

then  set  /  =  /  +  1  and  go  back  to  step  2.  If  ^  then  stop. 

4.2.2.  Neural  Networic  Clustering 

Neural  network  has  to  discover  by  itself  the  relationships  or  similarities  between  the  data  points.  There  are  different  types  of 
learning  rules  for  Neural  Network  unsupervised  learning.  The  one  used  in  this  project  is  called  the  competitive  learning  rule, 
or  winner  takes  alt  rule,  specifically  the  Kohonen  Learning  Rule.  Here  learning  is  based  on  the  clustering  of  the  input  data 
into  similar  groups  and  separation  of  dissimilar  data  points.  The  weights  (W)  of  this  network  can  be  viewed  as  the  centroid 
for  the  clusters  in  which  we  are  going  to  fit  the  data  points.  The  learning  for  the  Kohonen  Network  is  given  by: 

=  w*  +a  *  ~  w*  )  (9) 


where; 


(10) 


This  method  is  also  called  Vector  Quantization.  The  weights  of  this  system  can  be  viewed  as  unit  vectors.  The  system 
measures  the  cosine  of  the  angle  between  the  input  vector  and  the  weight  vectors,  and  compare  them.  The  weight  vector  that 
is  more  aligned  to  the  input  vector  wins,  and  is  updated,  the  other  weight  vectors  remain  unchanged.  The  system  keeps  on 
performing  this  task  until  the  changes  in  the  weight  vectors  reach  a  predetermined  minimum. 


4.2.3.  Fuzzy  Logic-Based  Neural  Network  Clustering 

Competitive  learning  is  one  of  the  major  clustering  techniques  in  neural  networks.  Making  a  generalization  of  the 
conventional  competitive  learning  rules  we  can  obtain  a  fiizzy  competitive  learning  rule  based  on  the  FCM  algorithm.  The 
fuzzy  competitive  learning  rule  is  derived  by  minimizing  the  objective  function  J  in  (5)  applying  the  Gradient  Descendent 
Method.  The  network  to  be  implemented  is  one  very  similar  to  the  Kohonen  Network  presented  earlier  with  the  exception 
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that  the  weights  in  this  Network  will  be  fuzzy  weight  vectors.  Again  this  weights  will  correspond  to  the  centroids  of  the  data 
clusters.  The  learning  rule  for  this  network  is  given  by; 


*+i 


=  (if  +  11Y  m  \xj  -  n,  ]:;■  =  1 «;  i  =  1 ,2,...,  C 


(11) 


where; 

(12) 

with  m  and  u  having  the  same  restrictions  as  in  the  FCM  algorithm.  We  can  see  that  this  corresponds  to  the  FCM  algorithm 
with  the  difference  that  instead  of  classifying  the  entire  data  set  and  then  updating  the  weights,  the  weights  are  updated  as 
each  input  vector  is  evaluated  by  the  system.  It  is  important  to  see  that  it  would  correspond  to  the  Crisp  Kohonen  Network  if 
the  fuzzy  exponential  weight  (m)  was  equal  to  one. 


4.3  Use  of  regularization  to  fix  the  parameter  estimation  process  (Dr.  Luis  O.  Jimenez) 

This  section  presents  the  effect  of  the  dimensionality  of  hyperspectral  data  in  the  estimation  of  the  covariances  and  its 
consequence  in  the  classification  process.  The  required  number  of  samples  to  train  a  quadratic  classifier  increases 
quadratically  with  the  number  of  features  (Jimenez  and  Landgrebe  1998).  This  is  critical  for  hyperspectral  data.  Due  to 
problems  in  the  covariance  matrix  estimation  the  data  points  in  a  normal  distribution  have  a  tendency  to  be  declared  as  an 
outlier  with  respect  to  the  training  samples.  The  definition  of  an  outlier  is  as  follows.  X  is  an  outlier  of  a  normal  distribution 
with  mean  |i|  and  covariance  Xj  if 


(13) 

Where  T  is  a  value  that  takes  into  consideration  the  dimensionality  of  the  data  and  is  related  to  the  Chi  Square  distribution.  In 
a  realistic  setting  using  statistical  pattern  recognition  the  mean  Pi  and  covariance  Xj  are  replace  by  the  maximum  likelihood 
estimates  that  are  computed  based  on  the  training  samples.  The  following  experiment  will  show  the  consequence  of  not 
having  an  adequate  number  of  training  samples  in  the  detection  of  outliers  in  hyperspectral  data 

Using  hyperspectral  data  from  the  AVIRIS  sensor  that  correspond  to  soil,  200  pixels  were  used  for  training  samples  and  200 
used  for  testing  samples.  We  rendered  an  experiment  where  we  choose  a  T  to  reject  the  1%  less  likely  of  the  data  as  outliers. 
In  Figure  I  we  observe  that  using  the  ML  estimate  of  the  covariance  around  1%  of  the  training  samples  were  rejected  as 
outliers.  This  results  is  consistent  with  our  expectations.  For  the  testing  samples  the  amount  of  data  points  that  are  classifiy  as 
outliers  increases  with  the  dimensionality  until  100%  were  declared  outlier.  When  we  used  the  identity  matrix  instead  of  the 
ML  estimate  the  problem  is  mitigated  for  testing  and  augmented  for  training.  For  a  well  estimated  covariance  matrix  only  1% 
of  the  training  and  testing  samples  should  be  declared  outliers. 
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Figure  la.  Number  of  outliers  in  testing  and  training 
samples  using  the  Maximum  Likelihood  covariance 
matrix. 


Figure  lb.  Number  of  outliers  in  testing  and  training 
samples  using  the  Identity  matrix  as  the  covariance 
matrix. 


The  main  purpose  of  this  part  of  the  investigation  was  to  apply  regularization  to  the  estimation  of  the  covariance  matrix  when 
the  size  of  the  training  samples  is  not  greater  than  the  dimension  d  of  the  feature  vector  or  if  is  not  appreciably  larger 

than  d.  The  regularization  schema  is  as  follows.  A  new  regularized  covariance  matrix  will  be  compute  using  the  Maximum 
Likelihood  estimate  from  the  training  sample  and  the  identity  matrix  as  is  expressed  in  equation  (14). 


(14) 

Where  Sf'^is  the  regularized  covariance  matrix,  is  the  maximum  likelihood  estimated,  is  the  corresponding 
identity  matrix,  d  is  the  dimension  of  the  feature  vector  [Duda,  Hart  and  Storic,  2001].  The  parameter  of  regularization  y  could 
have  values  between  0  and  1.  When  y  =  0  we  the  regularized  covariance  is  the  Maximum  Likelihood  estimate.  When  y  «  1 
the  regularized  covariance  is  equal  to  the  identity  matrix  multiply  by  a  constant.  The  effect  of  this  in  the  identification  of 
outliers  will  be  shown  in  the  results. 


4.4  Band  Subset  Selection  (Dr.  Miguel  Velez) 

Observations  from  hyperspectral  imaging  sensors  lead  to  high  dimensional  data  sets  from  hundreds  of  images  taken  at  closely 
spaced  narrow  spectral  bands.  High  storage  and  transmission  requirements,  computational  complexity,  and  statistical 
modeling  problems  combined  with  problem  prior  physical  insight  motivate  the  idea  of  dimension  reduction  using  band 
selection.  A  standard  approach  for  dimension  reduction  is  principal  component  analysis  [Geladi,  P.  and  H.  Grahn].  In  this 
approach,  the  original  hyperspectral  image  is  transformed  by  means  of  linear  transformations  into  a  set  of  uncorrelated  or 
orthogonal  “images.”  A  well-known  fact  for  multispectral  and  hyperspectral  images  is  that  most  of  the  spatial  information 
content  is  summarized  by  the  first  few  principal  components  f Geladi  and  Grahn] [Schowengerdt].  A  disadvantage  of  this 
approach  is  the  inherent  transformation  of  the  original  hyperspectral  image  (physical  meaningful  spectral  data)  into  linear 
combinations  of  bands  with,  in  many  instances,  little  or  no  physical  relation  with  the  spectral  information  content  on  the 
original  image. 


9 


An  alternative  dimensionality  reduction  approach  that  keeps  the  physical  insight  from  spectral  knowledge  is  band  subset 
selection.  In  band  subset  selection,  we  select  a  subset  of  bands  that  in  some  sense  summarizes  most  of  the  information 
contained  in  the  original  hyperspectral  data  cube.  Band  selection  can  be  based  on  physical  insight  or  optimization  of  some 
information  content  or  of  class  separability  measure  [Pdsztor  and  Csillag][Price][San  Miguel-Ayanz  and  Biging][Van  den 
Broek  et.  al].  From  a  classification  point  of  view,  band  selection  is  a  feature  subset  selection  problem.  By  using  band 
selection,  we  can  reduce  not  only  the  cost  of  recognition  by  reducing  the  number  of  bands  that  need  to  be  collected  (as  in 
reconfigurable  sensors  or  in  sensor  design),  but  in  some  cases  it  can  also  provide  better  classification  accuracy  due  to  finite 
sample  size  effects  [Jimenez  and  Landgrebe,  98] [Jimenez  and  Landgrebe,  2000][Zongker  and  Jain]. 

The  optimal  band  subset  selection  problem  is  a  combinatorial  optimization  problem  requiring  the  use  of  search  methods  to 
solve  it  The  main  objective  of  this  research  work  was  the  development  of  fast  and  reliable  algorithms  for  band  selection.  In 
this  research,  we  developed  band  selection  methods  based  on  subset  selection  methods  presented  in  [Golub  and  Van  Loan, 
1997]  to  determine  the  subset  of  most  independent  columns  in  a  matrix.  Two  approaches  were  studied  based  on  singular 
value  decomposition  (SVD)  and  rank  revealing  QR  (RRQR)  factorization.  In  our  work,  we  evaluated 

•  which  one  of  the  potential  algorithms  worked  better  in  our  application.  We  showed  in  [Velez-Reyes,  et  al,  2000]  that 
band  selection  based  on  SVD  was  more  robust  than  band  selection  using  RRQR  in  terms  of  closeness  to  the 
principal  components. 

•  comparison  of  the  SVD  method  to  other  methods  that  try  to  approximate  the  principal  components  presented  in  the 
literature.  We  showed  in  [Velez-Reyes  et  al,  Sept.  2001]  that  SVD-based  band  selection  was  much  better  in 
selecting  an  approximation  than  the  other  methods  because  it  deals  with  the  band  selection  in  a  multivariable  fashion 
compared  to  the  other  approaches  that  use  a  greedy  approach. 

Detailed  presentation  of  these  results  is  given  in  [Velez-Reyes  and  Jimenez,  Seattle  98][V61ez-Reyes  et  al,  San  Diego 
97] [Velez-Reyes  et  al,  Orlando  2000] [Velez-Reyes  and  Linares,  Sept.  2001]  (copies  included).  Applications  of  the  band 
selection  methods  developed  in  this  research  to  classification,  compression  and  biomedical  applications  are  presented  in 
[Jimenez-Rodriguez  et  al.  Sept.  2001][Jimenez  et  al,  Orlando  April  1999][Hunt  and  Velez-Reyes] [Rodriguez-Diaz  et  al,  July 
2000] 

In  the  following  sections,  we  will  discuss  the  general  problem  of  dimension  reduction  and  then  formulate  the  band  selection 
problem  and  motivate  the  idea  of  approximating  principal  components  as  a  way  to  select  bands.  We  then  present  a  summary 
of  the  results  of  this  research  project.  The  reader  is  referred  to  the  publications  for  more  details. 

4.4.1  Band  Subset  Selection  and  Optimal  Dimension  Reduction 

From  a  statistical  modeling  perspective,  as  the  number  of  bands  increases,  the  discrimination  capacity  of  hyperspectral  data 
should  increase  as  well.  However,  the  need  for  training  samples  for  a  classifier  can  increase  exponentially  with  the  number  of 
bands  depending  on  the  classifier  being  used  [Chan  and  Hansen].  This  is  the  so-called  curse  of  dimensionality.  Therefore,  it 
is  of  interest  to  develop  methodologies  to  reduce  the  dimensionality  of  the  hyperspectral  image  data  but  at  the  same  time 
retain  as  much  as  possible  of  their  class  discriminatory  information. 
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Dimension  reduction  is  the  general  problem  of  projecting  a  data  set  into  a  lower  dimensional  space.  Let  x  be  an  n* 
dimensional  random  vector  (a  pixel)  with  zero  mean  and  covariance  matrix  !£x-  We  wish  to  consider  a  linear  dimension^ 
reducing  transformation  of  x  to  a  random  variable  y  given  by 

y  =  A^x 

where  A  is  a  nxp  matrix  with  p<n  and  A^A=Ip  {orthonormal).  Thus  y  is  a  p-dirnensional  random  variable.  From  a 
classification  perspective,  we  would  like  to  select  the  A  matrix  such  that  in  the  lower  dimensional  space  the  classes  are  as 
separated  as  possible  leading  to  improve  performance  of  the  training  algorithms  and  the  classifier. 

The  selection  of  the  appropriate  projection  matrix  A  can  be  formulated  as  a  constrained  optimization  problem 

max®(A,x)  (15) 

A**^A=I 

where  X  is  the  image  and  4>  is  a  properly  selected  objective  function.  The  selection  of  the  objective  function  is  a  key  step  in 
the  design  of  the  dimension  reduction  algorithm.  It  is  desired  to  select  the  cost  function  <I>  that  will  result  in  a  projection  with 
large  statistical  separability  and  small  class  variability.  Another  alternative  to  select  <I>  is  to  keep  as  much  as  possible  of  the 
spatial  information  available  on  the  image.  For  this  later  case,  principal  components  are  the  optimal  solution. 

A  dimension  reduction  problem  of  particular  interest  for  applications  where  hyperspectral  sensors  can  be  used  is  band  subset 
selection.  The  band  subset  selection  problem  can  be  framed  in  framework  discussed  previously  by  further  restricting  the 
projection  matrix  A  as  follows 

where  P  is  a  permutation  matrix.  The  net  effect  of  this  constraint  is  that  the  dimension-reducing  transformation  A  now  selects 
a  subset  of  p  of  the  original  variables  x  as  follows 

y  =  A\  =  [l,  0]P’'X  = 

The  selection  of  a  subset  of  bands  have  several  interesting  advantages  compared  to  more  general  projections  since  we  are 
retaining  the  physical  meaning  of  the  data  (no  combinations)  in  order  to  (a)  maximize  human  understanding,  (b)  combine 
spectral  data  with  other  data  types,  and  (c)  exploit  physical  modeling/simulation.  Also  this  information  can  be  used  in 
optimizing  sensor  design. 

The  selection  of  the  p-optimal  bands  leads  to  a  combinatorial  optimization  problem  with  a  very  large  dimension  solution 
space.  For  instance,  selecting  10  out  of  224  bands  (as  in  AVIRIS)  results  in  searching  approximately  7,148x10*® 
possibilities.  This  problem  can  be  tackled  using  standard  search  mechanisms  for  combinatorial  optimization  problems  that 
are  quite  time  consuming.  The  researchers  have  proposed  an  automated  band  selection  algorithm  [V61ez-Reyes  and  Jimenez, 


Xi 
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Settle  98][Velez-Reyes  et  al,  San  Diego  97][Velez-Reyes  et  al,  Orlando  2000]  that  is  based  on  algorithms  to  select  the  most 
independent  columns  of  a  matrix.  The  determination  of  the  most  independent  columns  of  a  matrix  is  still  a  combinatorial 
optimization  problem.  However,  good  approximated  solutions  based  on  singular  value  decomposition  (SVD)  [Golub,  G.  and 
C.F.  Van  Loan]  and  rank  revealing  QR  (RRQR)  factorizations  [Chan  and  Hansen]  are  presented  in  the  linear  algebra 
literature.  It  can  be  shown  [Golub,  G.  and  C.F.  Van  Loan][Chan  and  Hansen],  that  bands  selected  using  this  method,  under 
certain  circumstances,  are  a  good  approximation  to  the  principal  components  and,  therefore,  will  summarize  most  of  the 
spatial  variability  information  contained  in  the  data  cube  [Velez-Reyes  and  Jimenez,  Settle  98][  Velez-Reyes  et  al,  San  Diego 
97].  The  reduced  computational  time  is  achieved  because  these  matrix  decompositions  that  can  be  computed  in  polynomial 
time, 

4.4.2  Principal  Component  Analysis  of  Hyperspectral  Imagery 

Let  X  be  the  hyperspectral  image  arranged  in  matrix  form  with  singular  value  decomposition  (SVD) 

X-UXV*^ 

where  U=*Iui,  U2, ....  Un]  and  V“|vi,  V2, vj  are  orthonormal  matrices  of  the  left  and  the  right  singular  vectors  respectively, 
2=diag{ori,  02, Cn}*  and  Oi  is  the  i-th  singular  value  of  X.  The  singular  values  are  ordered  according  to  magnitude  CTi  >  02 
^  On.  The  matrix  X  is  an  mxn  matrix  where  n  is  the  number  of  bands  and  m  is  equal  to  the  total  number  of  pixels  in  the 
image.  The  columns  of  the  matrix  X  are  constructed  by  stacking  the  columns  of  the  image  at  each  channel. 

The  j-th  principal  component  is  given  by 


PCf=OjUj-Xvj.  j=l,2....,n 


The  percentage  of  variability  that  is  explained  by  each  individual  principal  component  is  given  by 


%Var, 


jHh  component  n 


(16) 


For  hyperspectral  data,  due  to  redundancy,  it  is  often  the  case  that  at  one  or  many  points  in  the  singular  value  spectra  that  Op 
» cip+i  for  some  p<n  which  implies  that  the  matrix  X  is  near  rank  deficient.  Therefore,  most  of  the  spatial  information  about 
the  hyperspectral  image  is  summarized  in  a  p-dimensional  subspace.  It  turns  out  that  in  most  hyperspectral  applications  p«n 
resulting  in  a  significant  reduction  of  dimensionality.  For  later  reference,  we  will  denote  by  Ui  the  matrix  formed  by  the  first 

p  left  singular  vectors,  i.e.U  =  [lJ,  U2],  The  total  percentage  of  variability  explained  by  the  first  p  principal 


components  is  given  by 


%Var  = 
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Data  reduction  using  principal  components  is  achieved  by  transforming  the  original  image  into  a  set  of  orthogonal  images 
images  (or  principal  components),  and  keeping  only  the  first  p.  A  clear  disadvantage  of  this  approach  as  shown  by  (16)  is  the 
inherent  transformation  of  the  original  hyperspectral  image  X  (physical  meaningful  spectral  data)  into  linear  combinations  of 
bands  with,  in  many  instances,  little  or  no  physical  relation  with  the  spectral  information  content  on  the  original  image. 

4.4.3  Matrix  Factorization  Based  Band  Subset  Selection 

To  simplify  our  presentation,  we  will  reformulate  the  band  subset  selection  problem  as  the  problem  of  reordering  the  columns 
of  the  matrix  X  using  a  permutation  matrix,  P.  Let  X  be  the  matrix  generated  by  multiplying  X  by  P 

X  =  XP  = 

Notice  that  the  only  difference  between  X  and  X  is  the  ordering  of  their  columns.  The  problem  of  band  selection  can  be 
posed  as  finding  a  permutation  matrix  P  such  that  the  mxp  matrix  X,  has  the  bands  with  the  desired  properties. 


X, 

setectrd 
^bmds 


Notice  that  the  singular  value  decompositions  of  X  and  X  are  related  by 

X  =  XP  =  USV^P  =  us(p'^vf  =  usv^ 


(17) 


so  the  permutation  matrix  only  has  an  effect  on  the  order  of  the  right  singular  vectors  and,  hence,  both  matrices  have  the  same 
principal  components.  A  question  of  importance  in  this  analysis  is  in  what  sense  X|  can  be  a  good  approximation  to  Ui.  In 
order  to  evaluate  how  close  the  selected  bands  are  to  the  principal  components,  we  decided  to  use  the  canonical  correlation 
between  the  selected  bands  and  the  corresponding  principal  components.  Canonical  correlation  is  a  multidimensional 
extension  of  the  cosine  of  the  angle  between  two  vectors  to  the  cosine  of  the  angle  between  two  subspaces  [Wickens].  In  our 
application,  there  are  two  spaces  of  interest,  the  p-dimensional  space  generated  by  the  selected  bands  and  the  p-dimensional 
space  generated  by  the  first  p  principal  components.  The  canonical  correlation  is  the  cosine  of  the  smallest  angle  between 
vectors  generated  by  all  possible  linear  combinations  of  the  selected  bands  and  vectors  generated  by  all  possible  linear 
combinations  of  the  principal  components.  This  angle  in  a  sense  measures  the  similarity  between  the  two  subspaces.  We 
think  this  is  the  best  measure  of  proximity  between  selected  bands  and  principal  components. 

An  important  relation  between  this  angle  and  the  singular  values  of  x,  and  X  is  presented  next  [Golub  and  Van  Loan] [Chan 
and  Hansen]. 

sin0(<R(U,),9l(X,))<^^^ 
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where  X(  A)  is  the  range  space  and  ai(A)  is  the  i-th  singular  value  of  matrix  A.  Ideally,  we  would  like  that  both  spaces  to  be 
aligned  which  would  imply  that  sin0=O.  A  sufficient  condition  for  this  to  happen  would  be  that  a^(x,)»a^,^,(X)-  h  is 
shown  in  [Golub  and  Van  Loan][Chan  and  Hansen]  that  this  later  condition  will  occur  if  a  gap  condition  is  met  in  the  original 
HSI  data  (i.e.  o  (X)»o  +,(X))  and  a  set  of  sufficiently  uncorrelated  bands  is  selected  which  will  guarantee  that 

o  (Xj)«a  (X)-  "fhis  is  the  approach  we  followed  in  the  band  selection  algorithms  presented  in  [V61ez-Reyes  and  Jimenez, 
Seattle  98][V61ez-Reyes  et  al,  San  Diego  97][Velez-Reyes  et  al,  Orlando  2000]. 


To  try  to  meet  the  condition  a^(x,)«a^(X)»  we  can  use  the  following  result  from  [Golub  and  Van  Loan]  that  the  p-th 
singular  value  of  X  and  of  X,  are  related  by 

o,(X)Sa,(X,)>j2^  (18) 

where  V„  is  the  upper  pxp  bloek  of  V .  To  obtain  a  sufficiently  independent  set  of  columns  (bands),  the  permutation  matrix 
P  must  be  chosen  such  O  f  V.T'  )  *  1 .  The  SVD  and  RRQR  subset  selection  algorithms  try  to  determine  good  choices  (not 
necessarily  optimal)  for  P. 


4.4.3. 1  SVD  Subset  Selection 
Notice  that 


the  leftmost  inequality  comes  from  the  fact  the  V,  is  an  orthonormal  matrix  and  the  interlacing  property  of  singular  values 
[Golub  and  Van  Loan],  If  V,,  is  well-conditioned  (i.e.  k(V„)  =  or^(  V„)/o„j„(y,)  “  1).  then  )  =  1  - 


A  heuristic  solution  [Golub  and  Van  Loan]  to  this  problem  can  be  obtained  by  computing  the  QR  with  column-pivoting 
factorization  of  the  matrix  V|  that  is  composed  by  the  first  p-right  singular  vectors  of  X.  To  understand  the  reasoning  behind 
this  heuristic  algorithm,  let 

V,^P  =  [v,I  ^J  =  Q[R„RJ 

where  Q  is  orthonormal,  P  is  the  permutation  matrix,  and  Rn  is  an  upper  triangular  matrix,  be  the  desired  factorization. 
Notice  that  Rn  is  nonsingular  and  that  V,T=QR„  and  therefore  K (  V, ,)  =  K (R ,,) .  Heuristically,  [Golub  and  Van  Loan] 
column  pivoting  tends  to  produce  a  well-conditioned  Rn  (i.e.  k(R,|)  *1)  and  so  the  overall  process  tends  to  produce 
.  Thus  we  obtain  the  SVD  band  selection  algorithm  based  on  the  SVD  subset  selection  algorithm  in  [Golub 
and  Van  Loan]  and  is  summarized  next. 
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Algorithm:  SVD  Subset  Selection 

1.  Construct  a  matrix  representation  X  of  the  hyperspectral  image.  Each  column  of  X  is  constructed  by  stacking 
the  columns  of  each  single-band  image. 

2.  Subtract  the  mean  from  each  band. 

3.  Compute  the  SVD  of  X. 

4.  Compute  the  QR  factorization  with  pivoting  of  the  matrix  Vj^  where  V  i  is  formed  by  the  first  p  right  singular  vectors  of 
X.  Save  the  pivot  matrix  P  that  results  from  this  factorization. 

5.  Compute  X  =  XP . 

6.  Take  the  first  p  columns  of  X  as  the  selected  bands. 

4.4.3.2  Subset  Selection  using  the  RRQR  Factorization 

As  suggested  in  [Chan  and  Hansen],  RRQR  can  be  used  to  determine  the  most  independent  columns  of  the  matrix  X.  The 
motivation  to  look  into  RRQR  is  because  SVD  subset  selection  algorithm  requires  the  computation  of  the  SVD  of  X,  which  is 
a  computational  intensive  procedure  when  compared  to  the  computation  of  RRQR  factorization.  For  and  mxn  matrix,  the 
number  of  flops  in  the  computation  of  the  SVD  is  in  the  order  of  mn^  while  for  the  RRQR  factorization  is  in  the  order  of  mn^ 
[Golub  and  Van  Loan].  The  reasoning  behind  this  approach  is  described  next. 

Let 


XP  =  [X,  xJ  =  Q 


where  Rh  and  R22  are  pxp  and  (n-p)x(n-p)  upper  triangular  matrices  respectively,  be  a  QR  with  pivoting  factorization  of  X. 
Notice  that 


X,=Q 


and  therefore  Gi{  Xj  CTi(Ri  1).  From  (1 8),  we  get  that 

so  that 

o'p(Ru).  1 

a,(X)  “aJV) 


(19) 


(20) 
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Therefore  by  selecting  P  such  that  ap(Rii)«ap(X),  we  get  a  subset  X,  with  the  desired  linear  independence  properties. 
Trying  to  make  <yp(Rii)«ap(X)  is  the  objective  in  the  pivoting  strategies  of  the  Type  I  algorithms  for  the  computation  of  the 
RRQR  factorization  [Chandrasekaran  and  Ipsen].  The  RRQR  band  selection  algorithm  presented  next  is  based  on  the  RRQR 
factorization  presented  in  [Chan  and  Hansen] [Chandrasekaran  and  Ipsen]. 


Algorithm:  RRQR  Band  Selection 

1)  Construct  a  matrix  representation  X  of  the  hyperspectral  image.  Each  column  of  X  is  constructed  by 
stacking  the  columns  of  each  single-band  image. 

2)  Subtract  the  mean  from  each  band. 

3)  Compute  the  QR  with  pivoting  factorization  XP=QR  of  X. 

4)  LetR<®^=R,andP^®^-P 
a)  For  1=0 top- 1  do 

i)  Compute  the  largest  singular  value  and  corresponding  right  singular  vector  of  R^‘\ 

ii)  Find  the  permutation  matrix  that  permutes  the  largest  element  in  absolute  value  of  to  the  top. 

iii)  Apply  this  permutation  to  the  columns  of  R^*^  and  the  last  n-1  columns  of  P^'^  to  form 
Retriangularized  R^”^  without  column  pivoting.  Set  equal  to  the  matrix  resulting  from  the 
elimination  of  the  first  column  and  first  row  of  the  retriangularized  R^'\ 

5)  Compute  X  = 

6)  Take  the  first  p  columns  of  X  as  the  selected  bands. 


The  RRQR  factorization  is  a  refinement  of  the  QR  with  pivoting  factorization  that  tries  to  achieve  ap(Rn)»  <Tp(X).  Since  R^‘^ 
is  an  upper  triangular  matrix,  its  largest  singular  value  and  singular  vector  can  be  easily  estimated  therefore  not  requiring  the 
computation  of  its  full  singular  value  decomposition  at  each  step. 


4.5  Lossless  Compresion  Method  for  hyperspectral  data  (Dr.  Shawn  Hunt) 

AVIRIS  images  are  hyperspectral  images  consisting  of  224  bands  in  the  visible  through  infrared  wavelengths,  380  to  2500 
nanometers.  Each  spectral  band  is  contiguous  and  approximately  10  nanometers  wide.  Because  of  this,  one  band  is  generally 
highly  correlated  with  its  neighbors.  The  precision  of  each  element  depends  on  the  amount  of  processing  done.  The  raw  data 
is  12  bits  (since  1995,  10  bits  before  that),  but  the  precision  increases  as  the  processing  is  done.  Data  that  has  been 
geometrically  and  radially  corrected  is  distributed  in  16  bit  format. 

The  large  number  of  bands  in  hyperspectral  images  means  that  they  will  be  close  spectrally,  and  thus  have  a  large  correlation. 
This  same  fact  makes  the  selection  of  a  subset  of  bands  difficult  because  of  the  large  number  of  combinations  possible.  For 
example,  if  we  are  to  select  two  bands  to  predict  the  others  we  have  24,976  possible  combinations.  If  we  are  to  select  three 
bands,  the  number  of  combinations  becomes  1,848,224.  An  exhaustive  search  then  becomes  impractical  for  all  but  the 
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smallest  problems.  A  lengthy  search  may  be  tolerated  if  it  is  only  to  be  done  once  for  a  large  number  of  images^  or  if  the 
compression  results  are  changed  drastically.  In  the  case  presented  here,  however,  at  least  six  bands  are  neede  to  achieve 
compression  comparable  to  using  one  contiguous  band,  as  shown  later.  The  1,64*10**  possible  combinations  make  it 
impossible  to  do  an  exhaustive  search,  and  another  method  must  be  found. 

Which  subset  of  the  224  bands  should  be  used  to  predict  the  rest?  Since  our  goal  is  to  have  simple  codes  and  high 
compression,  we  want  to  know  which  elements  should  be  used  as  input  to  the  predictor  in  order  for  the  prediction  error  to 
have  a  low  first  order  entropy.  A  direct  measure  of  this  can  be  done  using  conditional  entropies.  For  example,  the  conditional 
entropy  of  x  given  y,  H(x|y),  is  the  average  number  of  bits  needed  to  code  x  given  thaty  is  known,  where 


H(.x\y)  =  |  y*)).  (21) 

i  k 

Thus,  if  we  were  to  choose  between  y  and  z  for  predicting  x,  we  would  choose  the  one  that  gives  the  lower  conditional 
entropy. 

For  the  band  selection  problem,  say  we  want  to  know  which  set  of  M  bands  are  optimal  for  predicting  the  others.  This  can  be 
done  as  follows.  The  conditional  entropy  of  each  band  given  a  set  of  A/  bands  is  calculated.  This  is  then  repeated  for  all 
possible  combinations  of  A/ bands.  The  set  of  bands  that  give  the  lowest  average  entropy  are  optimal. 

The  problem  with  calculating  the  conditional  entropies,  as  mentioned  above,  is  time  complexity.  When  using  AVIRIS 
images,  the  large  number  of  bands  and  high  precision  becomes  significant.  Going  back  to  the  two  band  case,  224  entropies 
are  calculated  for  each  of  the  24,976  possible  combinations  for  a  total  of  5.6  million  entropies.  To  calculate  these  conditional 
entropies,  conditional  densities  (or  equivalently  joint  densities)  are  estimated.  For  the  two  band  case,  third  order  joint 
densities  are  needed.  If  the  data  is  represented  by  12  bits,  it  can  take  any  one  of  4096  possible  values.  Third  order  joint 
densities  means  4096^  =  6.9*  10*®  probabilites  must  be  estimated.  Further  agravating  the  problem  is  the  large  amount  of  data 
needed  for  good  estimates. 

A  conditional  entropy  can  be  viewed  as  the  entropy  of  the  prediction  error  if  an  ’optimal’  predictor  were  used.  As  calculating 
the  optimal  prediction  bands  using  entropy  is  not  tractable,  and  even  if  it  was,  there  is  no  easy  way  of  finding  or 
implementing  the  ’optimal*  predictor,  two  simplifications  will  be  made.  First,  linear  predictors  will  be  used,  and  second,  the 
mean  squared  error  will  be  used  instead  of  entropy  as  a  performance  measure.  In  other  words,  minimize  the  expected  value  of 
the  error  (MSB)  of  the  linear  predictor; 

€i=bi-y,  (22) 

where  e,  is  the  prediction  error,  6/  is  the  spectral  value  of  band  i,  and  y#  is  the  output  of  the  predictor.  The  linear  predictor  is 
selected  to  minimize 

E[bi-yi\S},  (23) 


17 


where  S  is  the  set  of  bands  used  for  prediction.  Of  course,  since  the  purpose  is  to  have  the  minimum  entropy,  the  results  will 
be  compared  to  using  entropies  when  possible. 

4,6  Decision  Fusion  For  Hyperspectral  Data  (Dr.  Luis  O.  Jimenez) 

Decision  fusion  is  one  of  the  most  widely  used  procedures  of  data  fusion.  Different  classification  methodologies  are  used  and 
a  local  decision  is  performed  at  each  one  of  them.  These  decisions  are  combined  in  a  Decision  Fusion  Center  [Klein].  The 
Decision  Fusion  Center  has  a  set  of  algorithms  to  integrate  the  individual  and  local  decisions  of  each  sensor.  The  algorithms 
are  based  on  different  techniques  such  as  Boolean  logic,  voting  schemes,  Fuzzy  logic,  statistical  approaches  and  Neural 
Network.  One  of  the  most  common  approaches  used  for  decision  fusion  is  the  Parallel  Fusion  Network  [Iyengar  et  al].  The 
classification  systems  observe  common  patterns.  These  systems  do  not  communicate  with  each  other  but  they  provide 
complementary  information  about  the  pattern  to  be  classified.  A  set  of  classification  system  is  complementary  if  each  one 
observes  some  information  that  the  others  do  not.  From  an  image  processing  perspective,  each  pixel  is  a  pattern  that  will  be 
discriminated. 

4.6.1  Unsupervised  Decision  Fusion  for  Hyperspectral  Data 

Most  of  the  high  dimensional  vector  space  of  hyperspectral  data  is  empty.  Over  fitting  of  the  data  occurs  due  to  the  lack  of 
enough  data  to  estimate  well  the  parameters  in  a  high  dimensional  space.  In  order  to  eliminate  some  of  the  redundancy,  band 
subset  selection  algorithms  are  applied.  By  taking  different  subsets  of  bands  of  a  hyperspectral  image,  applying  the  clustering 
algorithm  to  each  one  and  fusing  the  results  we  might  be  able  to  use  the  data  in  all  its  spectral  range  without  losing  valuable 
information,  and  at  the  same  time  dealing  with  the  curse  of  dimensionality.  We  exploited  the  fact  that  hyperspectral  sensors 
produce  redundant  information  by  grouping  bands  in  subsets  to  produce  local  classifications  at  every  group.  A  criterion  to 
group  the  bands  can  be  to  choose  the  most  independent  subsets.  Figure  2  shows  that  scheme  of  local  classifications  at  every 
subset. 

The  number  of  bands  in  the  local  classifications  should  be  high  enough  to  have  good  classification.  At  the  same  time  the 
number  of  bands,  that  is  the  dimensionality  of  the  local  classification,  should  be  made  adequate  enough  to  increase  the  ratio 
of  the  number  of  unlabeled  samples  per  number  of  bands.  That  ratio  will  produce  better  estimation  of  the  parameters  required 
for  first  and  second  order  statistics.  Those  estimations  will  exploit  the  amount  of  information  hyperspectral  data  have, 
avoiding  some  of  the  problems  of  high  dimensionality  previously  outlined. 

Let  X  be  the  vector  that  contains  all  the  output  of  die  hyperspectral  sensor  at  every  band.  It  has  the  measurement  of  all  d 
bands  in  the  sensor  data.  The  vector  X  is  subdivided  in  sub-vectors  Xj  that  contain  the  measurements  of  the  i*^  subset  of  bands 
that  will  be  used  in  local  classifications.  Its  structure  is  as  follows: 

X  =  [x,  XNr.whe«Xi=[xu  -  Xw.r 
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where  there  are  N  groups  of  bands.  The  group  of  bands  is  constructed  following  the  band  subset  selection  algorithm 

N 

explained  before,  dj  is  the  number  of  bands  in  the  i***  group,  with  the  property  that^dj  =  d*  <  d.  Where  d  is  the  total 

i=I 

number  of  bands.  We  can  now  feasibly  search  subsets  of  bands  that  are  mostly  independent  and  will  provide  acceptable  local 
unsupervised  classification. 


These  subsets  of  bands  are  fed  to  a  clustering  algorithm.  The  results  of  each  one  of  them  are  classification  maps,  uj  that  are 
fused  in  a  classification  center,  as  shown  in  Figure  2. 


X  _ , 

Clustering  1 

^l,dl  ' 

*2.1  T 

Ciustering  2 

*2.d2 

• 

*N.l 

Clustering  N 

VdN 

Figure  2.  Unsupervised  classification  and  decision  fusion  scheme  for  hyperspectral  data. 

How  are  the  subsets  of  bands  chosen?  First  we  choose  the  d|  most  independent  according  to  the  algorithm  of  band  subset 
selection.  This  will  construct  the  subset  1  of  di  bands.  We  applied  a  clustering  algorithm  on  it  that  will  result  in  a 
classification  ui.  Then  we  applied  the  band  subset  selection  algorithm  again  in  the  remaining  d-di  bands  and  choose  the  da 
mostly  independent  bands  for  the  second  subset  ua  is  the  result  of  applying  the  unsupervised  clustering  algorithm  to  that 
particular  subset.  We  continue  subsequently  until  we  have  the  appropriate  number  N  of  groups  of  subsets  of  bands.  U}  could 
be  seen  as  the  classification  result  of  the  pixel  X  in  the  i*^  subset  of  bands. 

Several  ways  of  using  data  fusion  in  hyperspectral  unsupervised  classification  were  studied  in  this  research  project.  These 
results  helped  us  in  our  understanding  of  hyperspectral  imaging. 
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4.6.2  Classification  Fusion  Center 


All  the  decision  fusion  mechanisms  discussed  in  this  section  are  based  on  a  pixel-by-pixel  basis.  One  of  the  most  known 
fusion  techniques  is  majority  voting.  This  scheme  fuses  classification  based  on  pixel  classification  with  most  occurrences. 
That  is,  the  class  with  major  incidence  will  decide  the  cluster  of  the  pixel  being  analyzed.  Other  decision  fusion  is  weighted 
majority  voting.  This  method  uses  the  same  principle  of  majority  voting  but  adds  weights  to  the  fused  pixels.  These  weights 
are  based  on  the  degree  of  independency  of  the  band  subsets.  As  mentioned  before  each  subset  of  bands  selected  has  a  degree 
of  independence.  The  first  subset  correspond  to  the  d|  bands  more  independents.  The  second  subset  corresponds  to  the  subset 
that  has  the  second  largest  degree  of  independence.  The  idea  is  to  bias  decision  to  the  clustering  results  of  the  most 
independent  subset  of  bands,  since  its  classifications  are  based  on  more  information  content.  Squared  weights  are  also  applied 
to  bias  even  more  the  decision  fusion  towards  the  most  independent  bands  classifications. 

Other  methods  tested  are  based  on  pixel  probability  density  functions  (pdf)  estimations,  taking  into  consideration  the 
estimation  of  the  density  distribution.  The  first  pdf  method  tested  was  maximum  pdf  fusion.  The  fused  image  is  obtained  by 
getting  the  maximum  pdf,  pixel  by  pixel,  for  each  classification,  and  then  classifying  to  the  pixel  with  greater  pdf  value.  The 
second  pdf  method  is  achieved  by  calculating  pdf  mean  per  pixel,  and  classifying  each  pixel  to  the  corresponding 
classification  of  the  pdf  mean  value.  Assuming  f(X/wk)  is  the  probability  density  fimetion  of  pixel  X  given  class  k,  and  [xi, 
X2,  X3,...,Xn]  is  the  pixel  X  at  subsets  i  to  N,  the  decision  fusion  rules  for  M  classes  can  be  summarized  in  the  following  way 
[Kittler][Jimenez,  Morales,  Creus]: 


4.6.2.I.  Majority  voting, 
Let 


f(x/Wj)>f(Xi/w,),  Vj 
else 


(25) 


Then  assign  pixel  X  to  the  k*  class  if, 

Vj.  (26) 

i=l  i=I 

4,6.2.2.  Max  pdf. 

Assign  pixel  X  to  the  k*  class  if, 

max[f(x/w  j]  =  max|max[f(x/Wj)]|,  (27) 

i*l  i*l  I  J 


4.6.2.3.  Max  average, 

Assign  pixel  X  to  the  k***  class  if. 
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«  9 


(28) 


4.6.2  A,  Linearly  weighted  majority  voting, 

Let 

(N-i+l),  f(Xi/w,)>f(x/wJ.  V; 

Cu  =  -| 

0,  else 

Assign  pixel  X  to  the  class  if, 

n 

t=:|  i*l 

4.6.2.5.  Quadratic  weighted  majority  voting, 

Let 

(N-i+l)*,  f(Xi/w=)>f{Xi/Wk),  Vj 

c«= 

0,  else 

Assign  pixel  X  to  the  k*  class  if, 

i=l  i=l 


(29) 


(30) 


(31) 


(32) 


5.  Summary  of  Most  Important  Results 

This  section  presents  the  most  important  results  of  the  research  accomplished  in  this  project.  The  first  section  is  related  with 
clustering  results  using  Fuzzy  C-Means  Clustering,  Neural  Networic  Clustering  and  Fuzzy  Neural  Networic  Clustering 
methods  applied  to  hyperspectral  data.  Section  2.2  has  results  of  applying  the  regularization  method  to  enhance  the 
covariance  estimation.  It  shows  the  effect  of  regularization  in  the  detection  of  outliers.  Section  2.3  presents  results  of  the 
Band  Subset  Selection  mechanism.  Section  2.4  presents  the  results  of  the  compression  algorithm  and  finally  the  Decision 
Fusion  mechanism  results  are  shown  in  section  2.5. 


5.1  Clustering  Results 

This  section  shows  the  results  of  our  earliest  experiments  of  the  effect  of  dimensionality  on  the  results  of  Fuzzy  C  Means 
clustering,  Neural  Network  clustering  and  Fuzzy  Neural  Network  clustering.  The  algorithms  were  applied  on  a  segment  of  an 
AVIRIS  frame  with  220  bands.  Figure  3  shows  the  result  of  the  Fuzzy  C  means  in  high  dimensionality.  Figure  4  and  5  show 
the  results  for  NN  Clustering  and  Fuzzy  Neural  Network  clustering  respectively  under  the  same  conditions. 
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Figure  3.  Fuzzy  C-Means  Clustering  algorithm  result 


Figure  4.  Neural  Network  Clustering  algorithm  result  Figure  5.  Fuzzy-Neural  Network  Clustering  algorithm  result. 

According  to  these  preliminar  results  we  could  observe  that  Fuzzy  C-means  is  a  more  stable  algorithm  under  high 
dimensionality  than  the  other  two.  It  is  able  to  detect  more  patterns  that  are  more  accord  to  the  a  priori  information  of  a 
testbed. 

5.2  Use  of  regularization  for  parameter  estimation  process 

This  section  presents  the  result  of  using  a  regularized  covariance  matrix  estimation  versus  using  the  maximum  Likelihood 
Estimate  with  a  relative  small  number  of  training  samples.  Two  hundred  of  training  samples  and  two  huiKlrcd  of  testing 
samples  from  the  same  soil  class  were  used.  They  were  obtained  from  a  segment  of  a  frame  from  the  AVIRIS  sensor  with  220 
bands.  Figure  6a  shows  the  number  of  outliers  in  the  set  of  200  training  samples  (in  green)  and  the  number  of  outliers  in  the 
set  of 200  testing  samples  (in  red)  when  using  the  Maximum  Likelihood  estimate  of  the  covariance  matrix  in  equation  (13).  It 
is  shown  that  the  number  of  outliers  increases  with  the  increment  of  the  number  of  bands.  Figure  6b  shows  the  number  of 
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outliers  when  the  regularize  covariance  matrix  was  used  for  a  small  y  (”  *02).  Obviously  the  number  of  outliers  in  the  test 
samples  with  the  increment  of  bands  were  reduce  using  a  regularization  method. 


l^i^Yihorcfbancfe 


Figure  6a.  Number  of  outliers  versus  number  of  bands  using 


Figure  6b,  Number  of  outliers  versus  number  of  bands  using  with  Y  =  .02 

53  Band  Subset  Selection  Results 

To  evaluate  the  performance  of  the  heuristic  algorithms  in  approximating  the  optimal  solution  for  the  subset  selection 
problem,  several  numerical  experiments  were  conducted  with  data  from  the  LANDSAT  TM  simulator  and  data  from  the 
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AVRIRIS  sensor  from  the  NW  Indian  pine  test  site.  LANDSAT  data  was  used  since  with  only  7  bands  the  subset  selection 
problem  can  be  solved  exactly  for  all  possible  combinations  and  subset  sizes.  With  the  AVIRIS  data  set,  we  only  show  results 
for  subsets  of  1  and  2  bands.  All  computations  were  done  using  the  MATLAB™  environment. 

5.3.1  LANDSAT  TM  Simulator  Example 

The  band  selection  methods  were  used  for  band  selection  using  a  multispectral  image  taken  over  the  town  of  Anasco  in 
Puerto  Rico  by  the  LANDSAT  TM  simulator.  Since  it  has  only  7  bands*  all  combinatorial  problems  associated  with  band 
selection  can  be  solved  exactly  and  allow  us  to  better  analyze  the  results  from  our  band  selection  algorithms. 

5.3. 1 . 1  Approximating  Principal  Components 

The  first  comparison  is  to  evaluate  how  well  the  selected  bands  approximate  the  corresponding  principal  components.  Table 
1  shows  the  singular  values  for  the  LANDSAT  Image  and  the  individual  and  cumulative  percentage  of  variability.  Notice 
that  with  three  PCs  we  have  over  99%  of  the  total  variability.  A  large  gap  can  be  identified  between  the  second  and  third 
singular  values.  Table  2  shows  the  bands  most  aligned  with  the  principal  components  and  the  corresponding  canonical 
correlation.  The  canonical  correlation  is  equal  to  the  cosine  of  the  angle  between  the  subspaces.  Tables  3  and  4  show  the 
bands  selected  using  SVD  and  the  RRQR  algorithms  and  the  corresponding  correlation.  Notice  that  the  bands  selected  using 
the  SVD  approach  are  always  more  correlated  to  the  PCs  than  those  selected  using  the  RRQR  factorization.  Notice  that  both 
algorithms  did  very  well  when  selecting  two  bands  due  to  the  gap  between  the  second  and  third  singular  values.  We 
highlighted  the  three-band  case  because  in  the  next  section  we  study  it  for  class  discrimination. 


Table  1 ;  Summary  of  Singular  Values  and  Corresponding 
Variability  for  LANDSAT  Image. 


Table  2:  Canonical  Correlation  of  the  Optimal  Bands  with 
the  Corresponding  Principal  Components  for  LANDSAT 


p 

Selected 

Bands 

Canonical 

Correlation 

1 

{5} 

0.8905 

2 

{3.5} 

0.9963 

T 

{1.4.5} 

0.9147 

4 

{1.3.5, 7} 

0.9892 

“T 

{iA4.6.7} 

0.9902 

6 

{1.2,3.4.5.7} 

0.8613 

PCi 

%Vari^ 

%Var 

1 

18,981 

60.69 

60.69 

2 

14,792 

36.86 

97.55 

3 

3,176 

1.70 

99.25 

4 

1,876 

0.55 

99.80 

5 

837 

0.12 

99.92 

6 

517 

0.05 

99.96 

7 

469 

0.04 

100.00 
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Table  3:  Canonical  Correlation  between  the  Principal 
Components  and  the  bands  selected  using  S  VD. 


Table  4:  Canonical  Correlation  between  the  Principal 
Components  and  the  bands  selected  using  RRQR. 


1 

Selected  Bands 

Canonical 

1 

Correlation 

1 

{6} 

0.7863 

2 

{3.6} 

0.9962 

3 

{1,3,6} 

0.8672 

4 

{U,6,7} 

0.9890 

5 

{1,2, 3,6,7} 

0.9314 

6 

{U.3,4,6.7} 

0.7961 

1 

Selected  Bands 

Canonical 

1 

Correlation 

I 

{6} 

0.7863 

2 

{3.6} 

0.9962 

B 

{1.4,6} 

0.9136 

4 

{1.3,6, 7} 

0.9890 

5 

{U.4.6,7} 

0.9902 

6 

{1,2,3.4,6.7} 

0.7961 

5.3. 1.2  Class  Discrimination  Performance 

Our  interest  in  looking  at  band  selection  based  on  matrix  factorization  was  to  use  it  as  the  feature  reduction  pre-processor  in  a 
classifier  as  shown  in  Figure  7.  The  performance  of  the  classifier  depends  on  separability  among  classes  so  we  would  like  to 
select  those  based  that  result  in  higher  separability.  In  this  section,  we  show  how  well  our  band  selection  algorithm  did  in 
selecting  bands  with  good  discrimination  performance.  In  the  experiment  presented,  we  will  select  3  bands,  C  means  with 
covariance  was  used  as  the  clustering  method,  the  initialization  method  was  based  on  eigenvalues  and  eigenvectors,  and  there 
were  a  total  of  8  clusters  identified.  Once  the  data  was  grouped  in  eight  clusters,  we  proceed  to  select  the  three  bands  that 
gave  the  best  class  separability  in  the  reduced  dimension  space.  Four  criterion  were  used  to  evaluate  class  separability: 
Bhattacharyya  (B),  Jeffreys-Matusita  (JM),  Divergence  (D),  and  Transformed  Divergence  (TD)  distances  [Chandrasekaran 


and  Ipsen]. 


Bhattachaiyya  Distance 


2,  +  2; 

2 

Jeffreys-Matusita  Distance 


JMy  =2{l-e 

where  By  is  the  Bhattacharyya  Distance 

•  Divergence 

=  i  rr{(s,  -  s,)(x;'  -  27')}+  i  7’r[(S7'  -  -  MjIm,  -  MjY  } 


•  Transformed  Divergence 


7X),  =2  1-e  8 


where  Djj  is  the  divergence. 


Since  evaluation  of  these  distances  required  evaluating  the  distance  among  several  classes,  we  show  results  for  selecting 
bands  to  maximize  average  distance  and  maximize  the  minimum  distance.  The  results  are  tabulated  in  Table  5.  For  the  3- 
band  case,  the  combination  selected  using  SVD  was  {1,4,6}  while  using  RRQR  factorization  was  {1,3,6}.  The  bands 
selected  using  SVD  were  the  optimal  for  maximizing  the  minimum  distance  while  in  the  average  case  it  was  in  the  top  5 
combinations  for  most  costs.  The  bands  selected  using  RRQR  was  number  6  in  maximizing  the  minimum  distance  but  not  in 
the  top  10  for  the  average  case.  This  result  points  to  the  better  performance  of  the  SVD  over  the  RRQR  algorithm  and  to  the 
fact  that  it  also  gives  bands  with  good  discrimination  capacity. 


Table  S:  Ranking  of  Solutions  Obtained  using  SVD  and  RRQR  Algorithms  (Ranking  =  I  is  Optimal  Solution). 


Distance 

Ranking  of  Bands  Selected  using  SVD 

Ranking  of  Bands  Selected  using  RRQR 

Minimum  distance 

Average  Distance 

Minimum  distance 

Average  Distance 

Bhattacharyya 

1 

5 

6 

21 

Jeffreys-Matusita 

1 

5 

6 

21 

Divergence 

» 

19 

6 

26 

Transformed  Divergence 

I 

3 

6 

17 

5.3.2  AVIRIS  Example 

The  subset  selection  method  was  applied  to  the  analysis  of  the  AVIRIS  data  set  from  the  Northwest  Indian  Pine  Test  site  in 
Indiana.  Only  computations  to  compare  with  principal  components  were  done  here.  Since  the  AVIRIS  sensor  has  220  bands 
solving  the  combinatorial  problems  that  arise  in  determining  the  optimal  discriminating  bands  require  use  of  search  methods 
and  we  are  currently  working  on  their  implementation  to  use  in  analysis  of  this  property  in  hyperspectral  data.  To  give  you  an 
idea  of  the  difficulties  that  arise  in  solving  the  exact  problem  for  the  AVIRIS  case,  if  2  bands  are  selected  there  are  24,090 
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possible  combinations  of  220  bands,  if  3  are  selected  there  are  1,750,540  possible  combinations,  and  for  4  bands  there  are 
94,966,795  possible  combinations. 


Figure  8  shows  the  variances  of  the  first  100  principal  components.  Only  in  the  first  10  principal  components  one  can  identify 
gaps.  This  will  prove  key  in  the  performance  of  the  RRQR  algorithm  as  we  shall  see  later.  The  canonical  correlation  of  the 
selected  bands  with  the  principal  components  for  bands  selected  using  SVD  and  RRQR  is  shown  in  Figure  9.  Notice  that 
after  9  bands,  the  RRQR  correlation  behave  very  erratically  and  mostly  low  correlation.  The  SVD  approach  shows  a  more 
stable  behavior  although  degraded  due  to  the  presence  of  no  gaps  after  the  lO"’  singular  value.  The  canonical  correlation  for 
the  SVD  bands  is  always  higher  that  those  for  RRQR  as  seen  in  the  LANDSAT  case. 


Figure  8.  First  100  Eigenvalues  of  the  Covariance  Matrix  for  the  AVIRIS  Indian  Pine  Test  Image. 


Figure  10  shows  the  number  of  flops  taken  by  SVD  vs.  RRQR.  As  expected  RRQR  took  fewer  flops.  In  the  SVD  approach 
the  bulk  of  the  computation  goes  into  computing  the  singular  value  decomposition  of  the  image.  The  QR  factorization  of  the 
V,  matrix  of  the  first  p  right  singular  vectors  is  negligible.  The  RRQR  is  computed  sequentially  so  It  is  worthy  to  point  out 
that  SVD  is  implemented  internally  in  MATLAB  and  optimized  by  MATLAB  staff.  We  used  a  RRQR  routine  not  well 
optimized  and  still  were  able  to  get  better  performance. 


Canonical 

Correlation 


Figure  9.  Canonical  correlation  analysis  of  the  selected  bands:  (a)  canonical  correlation  for  SVD  and  RRQR,  (b)  Singular 
Values. 
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Number  of  Flops:  SVDSS  Vs.  LRRQR 


Figure  10.  Number  of  flops  taken  by  RRQR  and  SVD  band  selection  algorithms. 

5.3.3  Final  Comments  on  Comparison  of  SVD  and  RRQR  Band  Selection  Algorithms 

In  these  sections,  we  presented  a  theoretical  basis  for  hyperspectral  image  band  selection  using  SVD  and  RRQR  factorization 
matrix  decompositions  and  compared  their  performance  in  their  capacity  to  approximate  principal  components  and  select 
bands  with  good  class  discrimination  properties.  Numerical  experiments  using  LANDSAT  and  AVIRIS  images  illustrate  the 
good  performance  of  the  proposed  approach  for  band  selection  and  its  closeness  to  the  optimal  solution.  Band  selection  is 
inherently  a  combinatorial  optimization  problem  and  the  proposed  band  selection  methods  compute  a  reasonable  solution  as 
the  experiment  showed  in  polynomial  time. 

5 .3 .4  Comparison  to  Other  Approaches 

In  this  section,  we  study  how  well  our  band  selection  algorithm  based  on  SVD  subset  selection  compares  to  other  approaches 
proposed  in  the  literature  for  band  selection  in  hyperspectral  imagery  based  on  principal  component  analysis.  We  briefly 
introduce  the  approaches  and  present  experimental  results  comparing  all  approaches  using  AVIRIS  data. 

5.3.4. 1  Sensitivity-Based  Band  Selection 

The  first  two  approaches  studied  take  advantage  of  the  fact  that  the  absolute  sensitivity  of  the  variance  of  the  i-th  principal 
component  to  the  variance  of  the  j-th  band  is  given  by  [Golub  and  Van  Loan] 

9 

where  Xj  is  the  variance  of  the  i-th  principal  component  and  Vij  is  the  j-th  entry  of  the  i-th  right  singular  (or  loading)  vector. 
The  two  algorithms  differ  in  the  way  this  sensitivity  information  is  used.  The  first  algorithm  select  those  bands  for  which  the 
variances  of  the  first  p-principal  components  are  most  sensitive.  The  second  algorithm  rejects  those  bands  for  which  the 
variance  of  the  smallest  n-p  principal  components  are  most  sensitive.  The  two  algorithms  are  summarized  next. 

Algorithm:  Forward  Sensitivity  Subset  Selection(FS^) 

1 .  Construct  a  matrix  representation  X  of  the  hyperspectral  image.  Each  column  of  X  is  constructed  by  stacking 
the  columns  of  each  single-band  image. 
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2.  Subtract  the  mean  from  each  band. 

3.  Compute  the  SVD  of  X. 

4.  Set  p  =  number  of  bands  to  select  and 

5.  fori- 1  top, 

i.  Select  band  ji  if 

1.  ;;  =  arg  max  kj 

2.  K=K+{ji} 

ii.  where  Vi  is  the  i-th  loading  vector  for  X. 

6.  SetX,=[xj,  |Xj,  -  XjJ  where jiSK. 

Algorithm:  Backward  Sensitivity  Subset  Selection (BS^) 

1.  Construct  a  matrix  representation  X  of  the  hyperspectra!  image.  Each  column  of  X  is  constructed  by  stacking 
the  columns  of  each  single-band  image. 

2.  Subtract  the  mean  from  each  band. 

3.  Compute  the  SVD  of  X. 

4.  Set  p  =  number  of  bands  to  select  and  K={  1,2,..  .,n} 

5.  for  i  =  p+1  to  n, 

Reject  band  ji  if 

Ji  =  ^n 

K=K-{ji} 

where  Vi  is  the  i-th  loading  vector  for  X. 

6.  SetX,=[xj,  Ix^,  -  XjJ  where jieK. 

Algorithms  FS^  and  BS^  correspond  to  algorithms  B4  and  B2  described  in  [JoIlife][King  and  Jackson]  respectively. 
Algorithm  BS^  was  applied  to  band  subset  selection  in  [Pasztor  et  al][CsiIIag  et  al], 

5. 3.4.2  Principal-Component-Regression  Band  Subset  Selection 

This  approach  is  based  on  linear  regression  of  the  HSI  bands  on  the  principal  components.  This  approach  in  presented  as 
algorithm  B3  in  [Jollife][King  and  Jackson]. 

Algorithm:  PC  Regression  Subset  Selection (PCRS^) 

1 .  Construct  a  matrix  representation  X  of  the  hyperspectral  image.  Each  column  of  X  is  constructed  by  stacking 
the  columns  of  each  single-band  image. 

2.  Subtract  the  mean  from  each  band. 

3.  Compute  the  SVD  of  X. 
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4.  for  i  =  1  to  n 


,=|(i-u,u;)x,|, 


5.  Set  p  =  number  of  bands  to  select  and  K=(l) 

6.  for  i  =  1  to  p, 

Select  bandji  if 


:arg  max  r. 


K=K+{jj} 

7.  Setx,  =[xj,  I  •••  Xj^]  where jiSK. 


This  algorithm  is  kind  of  a  greedy  strategy  for  band  selection.  As  we  shall  see  later  its  main  defect  is  that  it  does  not  take  into 
consideration  the  interdependencies  of  the  HSI  bands.  This  is  because  two  bands  with  small  sums  of  squares  rj  maybe  highly 
dependent  but  based  on  this  strategy  both  bands  could  be  selected  by  the  algorithm. 

5.3.43  Experimental  Results 

To  compare  the  performance  of  the  algorithms  in  approximating  the  principal  components  several  numerical  experiments 
were  conducted  using  an  AVIRIS  data  set  of  the  Cuprite  Mining  District,  Nevada  taken  in  1995.  This  image  has  the 
follovring  characteristics:  400  samples,  350  lines,  and  50  bands  in  the  spectral  range  of  1 .99080  to  2.47900  ^m  (bands:  1 72  to 
221).  Although  the  AVIRIS  sensor  provides  information  in  220  bands,  Cuprite  95  is  a  subset  of  50  bands.  A  view  of  this 
image  is  shown  in  Figure  10.  The  selected  spectral  range  is  useful  for  mineral  discrimination.  Here,  the  size  of  the  image  was 
manageable  and  allowed  us  to  find  the  subset  of  bands  with  the  smallest  canonical  correlation  with  the  principal  components 
for  comparison  purposes. 

Figure  12  shows  the  value  of  the  singular  values  for  the  CUPRITE  image.  We  can  see  that  only  in  the  first  7  principal 
components  one  can  identify  gaps.  Also  97%  of  the  total  variability  is  summarized  by  the  first  10  principal  components. 
Because  of  the  image  small  size,  we  were  capable  of  computing  the  optimal  solution  to  the  band  subset  selection  with  the 
largest  canonical  correlation  for  subsets  up  to  6  bands  and  this  is  shown  in  Table  6.  The  canonical  correlations  of  the  selected 
bands  with  the  principal  components  for  bands  selected  using  FS3,  BS3,  PCRS2,  and  SVDSS  are  shown  in  Tables  7  to  10 
respectively. 
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Figure  II.  1995  AVIRIS  Image  from  CUPRITE  Mining  District  Figure  12.  Singular  Values  Spectrum  for  the  CUPRITE  95 

in  Nevada.  AVIRIS  Image. 


Table  6:  Canonical  Correlation  of  the  Optimal  Bands.  Table  8:  Canonical  Correlation  of  the  Bands  Selected  using  BS». 


I 

Selected  Bands 

Canonical 

Correlation 

D 

(24) 

0.933 

m 

{1.50} 

0.9223 

E 

{9,42.50} 

0.8101 

m 

{1.8,25.50} 

0.9402 

E 

(1.9.25.36.49} 

0.8905 

E 

(1.9.24.  36.42.50} 

0.9313 

B 

Selected  Bands 

Canonical 

Correlation 

D 

0.8818 

0.2518 

E! 

0.0721 

□ 

(20,25,27,36}  I 

0.0390 

■EEMfll 

IlfeKilEdlMlQiBI 

Table  7:  Canonical  Coirelation  oftfie  Bands  Selected  using  FS^  Table  9:  Canonical  Correlation  of  the  Bands  Seiected  using 

PCRSl 


P 

Selected  Bands 

Canonical 

Correlation 

T 

{50} 

0.4615 

2 

{46.50} 

0.0426 

2 

{46.49,50} 

0.2500 

4 

(46,  48, 49, 50} 

0.0604 

5 

(18.46.48.49.50} 

0.0817 

6 

(18.  19.46.48.49,50} 

0.0471 

P 

Selected  Bands 

Canonical 

Correlation 

i 

(24} 

0.9330 

2 

{24,25} 

0.2832 

3 

{24,33.34} 

0.0653 

4 

(24. 33, 34. 35} 

0.0106 

5 

(26.31.32. 33,50} 

0.0208 

6 

(18,19,20.32.33,50} 

0.0090 

Table  10:  Canonical  Correlation  of  the  Bands  Selected  using  SVDSS. 


P 

Selected  Bands 

Canonical 

Correlation 

T 

{50} 

0.4615 

2 

(6, 50} 

0.8934 

3 

(5,48,50} 

0.5875 

4 

(6,37,49,50} 

0.7232 

5 

{9,37,45,49,50} 

0.8501 

6 

(6,18,37,47,49,50} 

0.7825 
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Figure  13  shows  the  canonical  correlations  for  the  selection  of  up  to  20  band  subsets  for  all  four  algorithms.  Here  we 
do  not  show  the  solution  for  the  optimal  case  because  its  computation  requires  large  computational  requirements. 
We  obtain  similar  results  as  we  see  in  Tables  7-10  where  in  almost  all  cases  the  SVDSS  algorithm  selected  the 
bands  with  fhe  largest  canonical  correlation.  Furthermore,  notice  that  after  7  bands  there  are  no  gaps  in  the  singular 
value  spectrum  but  still  the  SVDSS  is  capable  of  obtaining  bands  with  more  than  60%  canonical  correlation  for  most 
cases.  This  robustness  is  further  studied  in  [Velez-Reyes  et  al,  Orlando  2000]. 

Algorithms  FS^  BS^  and  PCRS^  fail  in  getting  a  reasonable  approximation  to  the  principal  components.  SVDSS 
gets  the  best  approximations  to  the  principal  components  and,  as  shown  in  [Linares  2001],  the  results  shown  here  are 
within  the  top  3%  of  all  possibilities.  This  can  be  explained  by  studying  the  construction  of  the  algorithms. 
Algorithms  FS^  BS\  and  PCRS^  do  the  band  selection  by  looking  at  one  possibility  at  the  time.  As  we  already 
discussed  for  PCRS^  this  approach  ignores  the  interdependencies  of  the  bands.  The  SVDSS  approach  is  kind  of  a 
sensitivity-based  approach  to  band  selection  since  it  uses  the  loading  vectors,  Vj  associated  with  the  first  p  principal 
components.  Notice  that  the  j-th  row  of  Vi  is  the  sensitivity  of  the  variance  of  the  first  p  principal  components  to  the 

variance  of  the  j-th  band.  When  the  QR  factorization  with  pivoting  is  applied  to  V,^ ,  the  selection  of  pivots  [Golub 
and  Van  Loan]  is  such  that  it  will  select  a  group  of  highly  independent  rows  with  large  norm.  In  other  words,  it  tries 
to  select  bands  that  have  a  significant  effect  on  all  the  principal  components  (rows  of  large  magnitude)  but  their 
effects  on  the  principal  components  are  relatively  independent  (band  interdependencies). 


Figure  13:  Canonical  Correlation  of  Bands  Selected  using  SVD  and  other  PC-based  Methods. 
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5.3 .4.4  Comments  on  Comparison  with  other  Approaches 

In  this  section,  we  presented  a  comparison  of  three  methods  for  band  subset  selection  that  use  information  from  the 
HSI  principal  components  presented  in  the  literature  and  the  SVD  band  selection  method  developed  in  this  research. 
It  was  shown  that  successful  methods  for  band  subset  selection  using  principal  components  should  take  into 
consideration  the  interdependencies  across  bands.  Numerical  experiments  with  the  CUPRITE  1995  AVIRIS  HSI  are 
used  to  illustrate  this  point  The  SVDSS  gave  the  best  performance  of  all  algorithms.  This  algorithm  also  shows 
robust  performance  even  in  cases  where  the  gap  condition  needed  for  high  canonical  correlation  was  not  met. 

5.4  Compression  Method  for  hyperspectral  data 

The  first  part  of  the  research  was  to  investigate  the  correlation  properties  of  hyperspectral  data.  Next,  this 
information  was  used  to  develop  lossless  compression  algorithms.  The  correlation  properties  of  hyperspectral 
images  depend  on  the  sensor  used.  The  research  here  used  AVIRIS  images.  These  images  consist  of  224  contiguous 
spectral  bands,  from  400nm  to  2500nm,  each  about  lOnm  wide.  The  correlation,  or  more  appropriately,  the 
statistical  relation  between  bands  was  studied  in  a  number  of  ways.  The  correlation  between  the  bands  is  shown  in 
Figure  14.  Since  this  information  is  going  to  be  used  for  developing  compression  algorithms,  it  is  more  appropriate 
to  look  at  the  magnitude  of  the  correlaUon.  When  used  for  prediction,  a  correlation  of  .7  and  -,7  is  the  same.  Figure 
15  shows  the  same  data  as  Figure  14.  but  with  magnitude  only.  Some  important  results  are  clearly  visible  in  the 
figures.  The  spectral  bands  near  MOOnrn  and  1900nm  go  almost  to  zero  because  of  atmospheric  absorption.  Because 
these  bands  have  little  or  no  ground  data,  they  will  have  low  correlation  to  other  bands,  and  even  between 
themselves.  These  bands,  starting  at  about  1 10  and  150  in  the  figures,  will  be  difficult  to  compress,  because  they  are 
mostly  noise,  and  are  not  useful  in  predicting  other  bands. 


Band  1 

Figure  14.  Correlation  coefficients  of  AVIRIS  data 
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Hgure  15,  Magnitude  ot  Correlation  coetticients  ot  A  vlKli>  data 


On  the  other  hand,  we  can  see  that  there  are  sets  of  bands  that  have  high  correlation.  The  bands  from  I  to  36,  37  to 
80,  81  to  103,  109  to  149  and  164  to  219  all  have  high  correlation  between  them.  Also,  bands  from  1  to  36  have  a 
relatively  high  correlation  to  the  bands  from  109  to  149  and  164  to  219.  The  bands  from  37  to  80  have  high 
correlation  among  them,  but  do  not  have  a  high  correlation  to  any  other  bands.  The  same  is  true  for  the  bands  from 
81  to  109.  It  would  seem  reasonable  then,  that  if  we  were  to  select  bands  for  predicting  the  others,  there  would  be 
bands  from  the  set  1-36, 37-80,  and  81-109.  As  shown  below,  this  is  exactly  what  the  band  selection  algorithm  does. 

The  correlation  of  the  bands  is  being  investigated  for  use  in  lossless  compression  algorithms,  where  reduction  of 
redundancy  is  the  goal.  In  lossless  compression  work,  some  bands  are  used  to  predict  others,  get  a  prediction  error, 
then  code  the  error.  In  work  done  on  reducing  spectral  redundancy  presented  in  the  literature[Wu  and  Memon], 
typically  only  one  band  is  used  to  predict  the  adjacent  band.  To  see  the  performance  of  this  approach,  the  correlation 
of  adjacent  bands  is  shown  in  Figure  16.  For  comparison,  the  average  correlation  coefficient  for  each  band  with 
respect  to  the  others  is  shown  in  Figure  17. 

Correlation  coefficients  give  a  good  estimate  of  how  well  bands  can  predict  others,  but  what  we  really  want  to  know 
is  the  information  content  of  some  bands  given  others.  This  will  tell  us  how  many  bits  are  needed  to  code  the  data, 
and  so  how  much  compression  can  be  obtained.  This  information  content  is  given  by  the  entropy,  another  statistical 
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measure,  more  amenable  to  compression  work.  The  entropy  of  a  band  calculated  with  log  base  2  is  the  information 
in  bits  per  pixel.  The  conditional  entropy  is  the  information  content  of  a  band  given  that  another  band  is  known.  In 
other  words,  the  conditional  entropy  is  the  number  of  bits  needed  to  code  each  prediction  error  if  a  perfect  predictor 
were  available.  These  results  are  shown  in  Figure  18.  The  x  axis  is  the  given  band,  and  the  y  axis  is  the  average 
entropy  of  all  other  bands. 


.Figure  16.  Correlation  coefficients  of  adjacent  bands  in  AVIRIS  data 
e 


Figure  17,  Average  Correlation  coefficient  of  each  band  in  AVIRIS  data 
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It  can  be  seen  that  the  bands  close  to  30  and  40  are  the  best  bands  to  use  for  prediction.  Specifically,  bands  41, 42 
and  29  have  the  first  through  third  lowest  average  conditional  entropies.  If  we  were  going  to  use  one  band  for 
prediction,  it  would  probably  be  band  41.  Notice  that  we  would  probably  choose  a  band  in  the  set  1-36  if  we  made 
the  choice  based  on  correlation  coefficients  (see  Figure  17). 


n 

e 


Figure  18.  Average  conditional  entropy  of  AVIRIS  data 

It  would  be  nice  to  have  conditional  entropies  conditioned  on  more  than  one  band,  but  the  calculation  of  this  is  too 
computationally  intensive.  To  calculate  the  conditional  entropies  in  Figure  18.  conditional  probabilities  were 
computed.  Since  AVIRIS  data  has  14  bit  precision,  each  data  can  have  2“= 16.384  possibilities,  and  there  are 
16,384^=268,345,456  probabilities  to  be  calculated.  Conditional  entropies  conditioned  on  two  bands  would  need 
16,384’=4.398'-  probabilities  calculated.  Since  this  is  not  possible,  different  simplifications  were  studied.  The  first 
simplification  was  to  use  a  linear  predictor  to  calculate  the  prediction  error,  then  calculate  the  entropy  of  the  error. 
This  can  be  use  for  a  limited  number  of  bands,  but  when  using  more  than  two  or  three  bands  for  the  prediction,  the 
number  of  combinations  used  becomes  unwieldy.  The  number  of  combinations  is: 

One  band :  224^=50,176  combinations 
Two  bands  :  5,299,800  combinations 
Three  bands :  385,118,800  combinations 
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Using  the  results  from  the  previous  year,  we  can  base  our  compression  work  on  look  for  combinations  of  bands  that 
will  be  good  spectral  predictors,  or  bands  that  can  predict  well  the  remaining  bands.  The  initial  idea  was  to  use 
nonlinear  predictors  for  the  data.  This  turned  out  to  be  a  bad  idea  for  two  reasons.  First,  the  nonlinear  predictors  are 
difficult  to  calculate,  and  lead  to  slow  compression  and  decompression.  Second,  as  shown  in  [Hunt],  the  nonlinear 
predictor  converges  to  a  linear  predictor  when  predicting  large  amounts  of  data.  In  this  case,  there  is  little  advantage 
to  using  the  nonlinear  predictors.  The  work  then  concentrated  on  band  selection  for  compression  using  linear 
predictors.  At  present,  algorithms  in  literature  use  one  band  for  predicting  the  adjacent  band.  The  compression  is 
done  sequentially,  one  band  at  a  time.  Because  the  bands  must  be  available  when  decompressing,  only  bands  that 
have  been  compressed  can  be  used  in  the  prediction.  Using  more  than  one  sequential  band  for  prediction  does  not 
increase  the  performance  enough  to  offset  the  increase  in  complexity.  This  strategy  achieves  acceptable 
performance,  but  the  compression  and  decompression  are  sequential.  In  many  applications  such  as  pattern 
recognition,  only  a  subset  of  the  total  number  of  bands  is  used.  This  proves  bad  for  the  compression  algorithms 
because  if  the  last  band  compressed  need  to  be  decompressed,  then  all  bands  must  be  decompressed.  The  work  here 
concentrated  on  finding  subsets  of  bands  that  would  equal  the  performance  of  using  sequential  bands.  In  this 
manner,  no  matter  which  band  needs  to  be  decompressed,  only  the  subset  of  bands  must  be  decompressed  first 
[Hunt,  and  Vdlezj. 

As  mentioned  above,  finding  a  good  subset  of  bands  is  a  combinatorial  problem.  Instead  of  going  through  an 
exhaustive  search,  the  research  here  used  algorithms  presented  by  V^lez  [Vdlez-Reyes  and  Jimenez],  to  quickly  find 
a  good  subset.  The  results  of  this  work  is  shown  below. 


Table  1 1 :  Fast  band  selection  results. 


Bands 

lin.  pred.  entropy 

SSE 

29 

7.830790 

243,796,908.522 

29-42 

7.262668 

31,454,577,590 

29-42-89 

6,776586 

13,437,745,488 

14-29-42-64 

6.620856 

8,834,423,644 

9-29-36-42-66 

6.543167 

7,293,346,614 

1-29-37-42-70-123 

6.046576 

6,382,762,255 
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Table  12:  Principal  Component  results. 


number  of  PC  used 

Hn,  pred.  entropy 

SSE 

1 

7.989636 

105,173,135,736 

2 

7.109786 

9,275.148,970 

3 

6.476988 

3,252,692,941 

4 

6.300056 

2,445,712,891 

5 

6.157913 

1,860.244,668 

6 

5.919479 

1,456,170,791 

Table  1 1  shows  the  entropy  error  of  using  from  one  to  six  bands  for  prediction.  Using  six  bands,  an  entropy  error  of 
6,05  was  achieved,  equaling  the  performance  of  using  one  band  to  predict  its  adjacent  band.  Table  12  shows  the 
error  entropy  and  SSE  using  the  principal  components  as  predictors.  This  was  done  as  a  comparison,  since  the 
principal  components  give  the  smallest  SSE  from  among  all  linear  combinations  of  bands. 

The  analysis  of  these  results  based  on  the  correlation  work  done  previously  is  interesting.  Above  it  was  shown  that 
the  band  with  the  lowest  conditional  entropy  is  band  41.  The  band  selector  here  chose  band  29,  This  is  reasonable  if 
we  go  back  to  the  correlation  coefficients.  Bands  from  1-36  have  high  correlation  between  them,  and  also  a  high 
correlation  to  the  bands  from  109-149  and  164-219.  The  bands  from  37-80  have  a  high  correlation  among  them,  but 
have  a  low  correlation  to  the  other  bands.  Based  on  this,  it  is  better  to  choose  a  band  in  the  set  1-36.  Also,  when 
choosing  two  bands,  the  algorithm  selects  29  and  42.  The  band  29  has  high  correlation  with  the  bands  mentioned 
above,  and  band  42  has  a  high  correlation  with  bands  in  the  set  37-80.  Going  back  to  Figure  15,  we  can  see  that  the 
only  set  of  bands,  besides  the  noise  bands,  that  do  not  have  a  high  correlation  with  bands  29  and  42  are  the  bands 
from  81  to  103.  In  table  1 1  we  can  see  that  when  three  bands  are  selected,  band  89  is  selected  along  with  bands  29 
and  42, 


5.5  Unsupervised  Decision  Fusion  Results 

This  section  shows  the  results  obtained  after  applying  the  decision  fusion  algorithms.  We  group  the  bands  in  10 
subsets  of  20  bands,  and  classify  10  clusters.  The  results  were  fused  using  the  five  mechanism  previously  discussed: 
majority  voting,  linear  weighted  voting,  square  weighted  voting,  maximum  pdf,  and  max  mean  pdf.  Figure  19  shows 
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a  frame  section  of  an  AVIRIS  image  of  Kennedy  Space  Flight  Center  in  Florida.  This  segment  shows  two  main 
areas  from  different  settings.  The  region  at  the  left  is  characterized  for  being  and  urban  area  and  the  one  at  the  right 
is  mainly  a  swamp  with  a  river  and  a  road.  Figures  20  to  24  show  the  classification  results  using  decision  fusion. 
Figures  20,  21,  and  22  show  the  voting  schemes  based  results:  majority  voting,  linear  weighted  voting  and  the 
square  weighted  voting  schemes.  They  produce  results  that  are  similar.  Figures  23  and  24  shows  the  methods  based 
on  probability  density  function:  maximum  pdf  and  max  mean  pdf.  They  are  similar  among  themselves  but  they  are 
different  with  respect  to  the  voting  schemes  in  the  first  group.  The  pdf  methods  are  able  to  classify  better  the  urban 
area  in  the  region  at  the  left  of  the  image.  The  spatial  structures  are  better  defined  in  that  classification  maps.  At  the 
same  time  the  pdf  based  fusion  mechanism  localized  some  urban  structures  in  the  swamp  region  at  the  right  of  the 
image.  Further  experiments  need  to  be  done  in  other  hyperspectral  images  in  order  to  detect  bias  in  the  decision 
fusion  algorithms.  This  understanding  will  help  us  modify  and  enhance  the  algorithm. 


Figure  1 9.  Segment  of  Original  Image  Figure  20.  Majority  Voting 


Figure  2 1 .  Linear  Weighted  Voting  Figure  22.  Square  Weighted  Voting 
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Figure  23.  Maximum  pdf 


Figure  24.  Max  Mean  pdf 
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